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Abstract 

We  present  a  general  framework  for  compar¬ 
ing  multiple  groups  of  documents.  A  bipar¬ 
tite  graph  model  is  proposed  where  document 
groups  are  represented  as  one  node  set  and 
the  comparison  criteria  are  represented  as  the 
other  node  set.  Using  this  model,  we  present 
basic  algorithms  to  extract  insights  into  sim¬ 
ilarities  and  differences  among  the  document 
groups.  Finally,  we  demonstrate  the  versatility 
of  our  framework  through  an  analysis  of  NSF 
funding  programs  for  basic  research. 

1  Introduction  and  Motivation 

Given  multiple  sets  (or  groups)  of  documents,  it  is  of¬ 
ten  necessary  to  compare  the  groups  to  identify  simi¬ 
larities  and  differences  along  different  dimensions.  In 
this  work,  we  present  a  general  framework  to  perform 
such  comparisons  for  extraction  of  important  insights. 
Indeed,  many  real-world  tasks  can  be  framed  as  a  prob¬ 
lem  of  comparing  two  or  more  groups  of  documents. 
Here,  we  provide  two  motivating  examples. 

1.  Program  Reviews.  To  better  direct  research  efforts, 
funding  organizations  such  as  the  National  Science 
Foundation  (NSF),  the  National  Institutes  of  Health 
(NIH),  and  the  Department  of  Defense  (DoD),  are  of¬ 
ten  in  the  position  of  reviewing  research  programs  via 
their  artifacts  ( e.g .,  grant  abstracts,  published  papers, 
and  other  research  descriptions).  Such  reviews  might 
involve  identifying  overlaps  across  different  programs, 
which  may  indicate  a  duplication  of  effort.  It  may 
also  involve  the  identification  of  unique,  emerging,  or 
diminishing  topics.  A  “document  group”  here  could 
be  defined  either  as  a  particular  research  program  that 
funds  many  organizations,  the  totality  of  funded  re¬ 
search  conducted  by  a  specific  organization,  or  all  re¬ 
search  associated  with  a  particular  time  period  (e.g.,  fis¬ 
cal  year).  In  all  cases,  the  objective  is  to  draw  compar¬ 
isons  between  groups  by  comparing  the  document  sets 
associated  with  them. 

2.  Intelligence.  In  the  areas  of  defense  and  intelli¬ 
gence,  document  sets  are  sometimes  obtained  from  dif¬ 


ferent  sources  or  entities.  For  instance,  the  U.S.  Armed 
Forces  sometimes  seize  documents  during  raids  of  ter¬ 
rorist  strongholds.1  Similarities  between  two  document 
sets  (each  captured  from  a  different  source)  can  poten¬ 
tially  be  used  to  infer  a  non-obvious  association  be¬ 
tween  the  sources. 

Of  course,  there  are  numerous  additional  examples 
across  many  domains  (e.g.,  comparing  different  news 
sources,  comparing  the  reviews  for  several  products, 
etc.).  Given  the  abundance  of  real-world  applications 
as  illustrated  above,  it  is  surprising,  then,  that  there 
are  no  existing  general-purpose  approaches  for  draw¬ 
ing  such  comparisons.  While  there  is  some  previous 
work  on  the  comparison  of  document  sets  (referred  to 
as  comparative  text  mining),  these  existing  approaches 
lack  the  generality  to  be  widely  applicable  across  dif¬ 
ferent  use  case  scenarios  with  different  comparison  cri¬ 
teria.  Moreover,  much  of  the  work  in  the  area  focuses 
largely  on  the  summarization  of  shared  or  unshared 
topics  among  document  groups  (e.g.,  Wan  et  al.  (201 1), 
Huang  et  al.  (201 1),  Campr  and  Jezek  (2013),  Wang  et 
al.  (2012),  Zhai  et  al.  (2004)).  That  is,  the  problem  of 
drawing  multi-faceted  comparisons  among  the  groups 
themselves  is  not  typically  addressed.  This,  then,  moti¬ 
vates  our  development  of  a  general-purpose  model  for 
comparisons  of  document  sets  along  arbitrary  dimen¬ 
sions.  We  use  this  model  for  the  identification  of  simi¬ 
larities,  differences,  trends,  and  anomalies  among  large 
groups  of  documents.  We  begin  by  formally  describing 
our  model. 

2  Our  Formal  Model  for 

Comparing  Document  Groups 

As  input,  we  are  given  several  groups  of  documents, 
and  our  task  is  to  compare  them.  We  now  formally 
define  these  document  groups  and  the  criteria  used  to 
compare  them.  Let  D  =  {d\,d2,  ■  ■  ■ ,  div}  be  a  doc¬ 
ument  collection  comprising  the  totality  of  documents 
under  consideration,  where  N  is  the  size.  Let  I)1'  be  a 
partition  of  D  representing  the  document  groups. 


'http : //en . wikipedia .org /wiki/ 
Document_Exploitation_ (DOCEX) 
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Definition  1  A  document  group  is  a  subset  Dp  £  Dp 
( where  index  i  £  {1 . . .  \DP\}). 

Each  document  group  in  Dp ,  for  instance,  might 
represent  articles  associated  with  either  a  particular  or¬ 
ganization  ( e.g .,  university),  a  research  funding  source 
( e.g .,  NSF  or  DARPA  program),  or  a  time  period  (e.g.,  a 
fiscal  year).  Document  groups  are  compared  using 
comparison  criteria ,  1)(  ,  a  family  of  subsets  of  D. 

Definition  2  A  comparison  criterion  is  a  subset  Df  £ 
Dc  (where  index  i  £  {1 . . .  \DC\}). 

Intuitively,  each  subset  of  Dc  represents  a  set  of 
documents  sharing  some  attribute.  Our  model  allows 
great  flexibility  in  how  Dc  is  defined.  For  instance, 
Dc  might  be  defined  by  the  named  entities  mentioned 
within  documents  (e.g.,  each  subset  contains  docu¬ 
ments  that  mention  a  particular  person  or  organization 
of  interest).  For  the  present  work,  we  define  Dc  by  top¬ 
ics  discovered  using  latent  Dirichlet  allocation  or  FDA 
(Blei  et  al.,  2003). 

LDA  Topics  as  Comparison  Criteria.  Probabilis¬ 
tic  topic  modeling  algorithms  like  FDA  discover  la¬ 
tent  themes  (i.e.,  topics)  in  document  collections.  By 
using  these  discovered  topics  as  the  comparison  cri¬ 
teria,  we  can  compare  arbitrary  groups  of  documents 
by  the  themes  and  subject  areas  comprising  them.  Fet 
K  be  the  number  of  topics  or  themes  in  D.  Each 
document  in  D  is  composed  of  a  sequence  of  words: 
di  =  (sa,  Si2, . . . ,  SiNi),  where  is  the  number  of 
words  in  di  and  i  £  {1 . . .  N}.  V  =  (Jfc=i  f(dt)  is 
the  vocabulary  of  D,  where  /(■)  takes  a  sequence  of 
elements  and  returns  a  set.  LDA  takes  K  and  D  (in¬ 
cluding  its  components  such  as  V)  as  input  and  pro¬ 
duces  two  matrices  as  output,  one  of  which  is  9.  The 
matrix  9  £  M.NxK  is  the  document- topic  distribution 
matrix  and  shows  the  distribution  of  topics  within  each 
document.  Each  row  of  the  matrix  represents  a  prob¬ 
ability  distribution.  Dc  is  constructed  using  I\  sub¬ 
sets  of  documents,  each  of  which  represent  a  set  of 
documents  pertaining  largely  to  the  same  topic.  That 
is,  for  t  £  {1 . .  .K}  and  i  £  {1 ...  N},  each  subset 
Df  £  Dc  is  comprised  of  all  documents  di  where 
t  =  argmax^,  9iX }  Having  defined  the  document 
groups  Dp  and  the  comparison  criteria  Dc ,  we  now 
construct  a  bipartite  graph  model  used  to  perform  com¬ 
parisons. 

A  Bipartite  Graph  Model.  Our  objective  is  to  com¬ 
pare  the  document  groups  in  Dp  based  on  Dc .  We  do 
so  by  representing  Dp  and  Dc  as  a  weighted  bipartite 
graph,  G  =  ( P ,  C,  E,  w),  where  P  and  C  are  disjoint 
sets  of  nodes,  E  is  the  edge  set,  and  w  :  E  — >  Z+ 
are  the  edge  weights.  Each  subset  of  Dp  is  repre¬ 
sented  as  a  node  in  P,  and  each  subset  of  Dc  is  rep- 

2  D°  is  also  a  partition  of  D,  when  defined  in  this  way. 


resented  as  a  node  in  C.  Let  a  :  P  — >  Dp  and 
P  :  C  — >  Dc  be  functions  that  map  nodes  to  the  doc¬ 
ument  subsets  that  they  represent.  Then,  the  edge  set 
E  is  {(u,  v)  |  u  £  P,v  £  C,a(u)  (T  /3(v)  0}, 

and  the  edge  weight  for  any  two  nodes  u  £  P  and 
v  £  C  is  w((u,v ))  =  |a(w)  (T  /3(v)\.  Concisely,  each 
weighted  edge  in  G  between  a  document  group  (in  P) 
and  a  topic  (in  C)  represents  the  number  of  documents 
shared  among  the  two  sets.  Figure  1  shows  a  toy  illus¬ 
tration  of  the  model.  Each  node  in  P  is  shown  in  black 
and  represents  a  subset  of  Dp  (i.e.,  a  document  group). 
Each  node  in  C  is  shown  in  gray  and  represents  a  subset 
of  Dc  (i.e.,  a  document  cluster  pertaining  primarily  to 
the  same  topic).  Each  edge  represents  the  intersection 
of  the  two  subsets  it  connects.  In  the  next  section,  we 
will  describe  basic  algorithms  on  such  bipartite  graphs 
capable  of  yielding  important  insights  into  the  similar¬ 
ities  and  differences  among  document  groups. 


Document  Groups  (Dp) 

Topics  IDCI 

Document  Group  2^ 

T°pic  2 

Document  Group 

Topic  3 

Document  Group  4£ - 

- O  Topic  4 

ToPic  5 

Document  Group  |  Dp|  - 

— -\j  Topic  6 

\I/Topic|Dc| 

Figure  1 :  [Toy  Illustration  of  Bipartite  Graph  Model.] 

Each  black  node  (i.e.,  node  £  P )  represents  a  document 
group.  Each  gray  node  (i.e.,  node  £  C)  represents  a  clus¬ 
ter  of  documents  pertaining  primarily  to  the  same  topic. 

3  Basic  Algorithms  Using  the  Model 

We  focus  on  three  basic  operations  in  this  work. 

Node  Entropy.  Let  w  be  a  vector  of  weights  for  all 
edges  incident  to  some  node  v  £  E.  The  entropy  H  of 
v  is:  H(v)  =  -T,iPil°g\uf\(Pi)’  where  pi  = 
and  i.  j  £  {1 . . .  |t/T|}.  A  similar  formulation  was  em¬ 
ployed  in  Eagle  et  al.  (2010).  Intuitively,  if  v  £  P, 
H (v)  measures  the  extent  to  which  the  document  group 
is  concentrated  around  a  small  number  of  topics  (lower 
values  of  H(v)  mean  more  concentrated).  Similarly,  if 
v  £  C,  it  is  the  extent  to  which  a  topic  is  concentrated 
around  a  small  number  of  document  groups. 

Node  Similarity.  Given  a  graph  G,  there  are  many 
ways  to  measure  the  similarity  of  two  nodes  based  on 
their  connections.  Such  measures  can  be  used  to  infer 
similarity  (and  dissimilarity)  among  document  groups. 
However,  existing  methods  are  not  well-suited  for  the 
task  of  document  group  comparison.  The  well-known 
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SimRank  algorithm  (Jeh  and  Widom,  2002)  ignores 
edge  weights,  and  neither  SimRank  nor  its  extension, 
SimRank++  (Antonellis  et  ah,  2008),  scale  to  larger 
graphs.  SimRank++  and  ASCOS  (Chen  and  Giles, 
2013)  do  incorporate  edge  weights  but  in  ways  that 
are  not  appropriate  for  document  group  comparisons. 
For  instance,  both  SimRank++  and  ASCOS  incorpo¬ 
rate  magnitude  in  the  similarity  computation.  Con¬ 
sider  the  case  where  document  groups  are  defined  as 
research  labs.  ASCOS  and  SimRank++  will  measure 
large  research  labs  and  small  research  labs  as  less  simi¬ 
lar  when  in  fact  they  may  publish  nearly  identical  lines 
of  research.  Finally,  under  these  existing  methods,  doc¬ 
ument  groups  sharing  zero  topics  in  common  could 
still  be  considered  similar,  which  is  undesirable  here. 
For  these  reasons,  we  formulate  similarity  as  follows. 
Let  NG(-)  be  a  function  that  returns  the  neighbors  of 
a  given  node  in  G.  Given  two  nodes  u.  v  £  P,  let 
Lu’v  =  Ng(u)  U  Ng(v)  and  let  x  :  /  — >  Lu,v  be  the 
indexing  function  for  Lu,v?  We  construct  two  vectors, 
a  and  b,  where  ak  =  w(u,x(k)),  bk  =  w(v,x(k)), 
and  k  £  I.  Each  vector  is  essentially  a  sequence  of 
weights  for  edges  between  u,v  £  P  and  each  node 
in  Lu,v .  Similarity  of  two  nodes  is  measured  using 
the  cosine  similarity  of  their  corresponding  sequences, 

~~~=*r .  which  we  compute  using  a  function  sim(-,  •). 
Ilall  II  ®ll 

Thus,  document  groups  are  considered  more  similar 
when  they  have  similar  sets  of  topics  in  similar  pro¬ 
portions.  As  we  will  show  later,  this  simple  solution, 
based  on  item-based  collaborative  filtering  (Sarwar  et 
al.,  2001),  is  surprisingly  effective  at  inferring  similar¬ 
ity  among  document  groups  in  G. 

Node  Clusters.  Identifying  clusters  of  related  nodes  in 
the  bipartite  graph  G  can  show  how  document  groups 
form  larger  classes.  However,  we  find  that  G  is  typ¬ 
ically  fairly  dense.  For  these  reasons,  partitioning  of 
the  one-mode  projection  of  G  and  other  standard  bipar¬ 
tite  graph  clustering  techniques  (e.g.,  Dhillion  (2001) 
and  Sun  et  al.  (2009))  are  rendered  less  effective.  We 
instead  employ  a  different  tack  and  exploit  the  node 
similarities  computed  earlier.  We  transform  G  into 
a  new  weighted  graph  Gp  =  (P,  Ep ,wszrn)  where 
Ep  =  {(u,  v)  |  u,v  £  P,  sim(u,  v )  >  £},  £  is  a  pre¬ 
defined  threshold,  and  wslm  is  the  edge  weight  function 
(i.e.,  wsm  =  sim).  Thus,  Gp  is  the  similarity  graph 
of  document  groups.  £  =  0.5  was  used  as  the  threshold 
for  our  analyses.  To  find  clusters  in  Gp ,  we  employ  the 
Louvain  algorithm,  a  heuristic  method  based  on  mod¬ 
ularity  optimization  (Blondel  et  al.,  2008).  Modularity 
measures  the  fraction  of  edges  falling  within  clusters 
as  compared  to  the  expected  fraction  if  edges  were  dis¬ 
tributed  evenly  in  the  graph  (Newman,  2006).  The  al¬ 
gorithm  initially  assigns  each  node  to  its  own  cluster. 

3 1  is  the  index  set  of  Lu'v. 


At  each  iteration,  in  a  local  and  greedy  fashion,  nodes 
are  re-assigned  to  clusters  with  which  they  achieve  the 
highest  modularity. 

4  Example  Analysis:  NSF  Grants 

As  a  realistic  and  informative  case  study,  we  utilize 
our  model  to  characterize  funding  programs  of  the  Na¬ 
tional  Science  Foundation  (NSF).  This  corpus  consists 
of  132,372  grant  abstracts  describing  awards  for  basic 
research  and  other  support  funded  by  the  NSF  between 
the  years  1990  and  2002  (Bache  and  Lichman,  2013).4 
Each  award  is  associated  with  both  a  program  element 
(i.e.,  funding  source)  and  a  date.  We  define  document 
groups  in  two  ways:  by  program  element  and  by  cal¬ 
endar  year.  For  comparison  criteria,  we  used  topics 
discovered  with  the  MALLET  implementation  of  LDA 
(McCallum,  2002)  using  I\  =  400  as  the  number  of 
topics  and  200  as  the  number  of  iterations.  All  other 
parameters  were  left  as  defaults.  The  NSF  corpus  pos¬ 
sesses  unique  properties  that  lend  themselves  to  exper¬ 
imental  evaluation.  For  instance,  program  elements  are 
not  only  associated  with  specific  sets  of  research  top¬ 
ics  but  are  named  based  on  the  content  of  the  program. 
This  provides  a  measure  of  ground  truth  against  which 
we  can  validate  our  model.  We  structure  our  analyses 
around  specific  questions,  which  now  follow. 

Which  NSF  programs  are  focused  on  specific  areas 
and  which  are  not?  When  defining  document  groups 
as  program  elements  (i.e.,  each  NSF  program  is  a  node 
in  P ),  node  entropy  can  be  used  to  answer  this  question. 
Table  1  shows  examples  of  program  elements  most  and 
least  associated  with  specific  topics,  as  measured  by 
entropy.  For  example,  the  program  1311  Linguistics 
(low  entropy)  is  largely  focused  on  a  single  linguistics 
topic  (labeled  by  LDA  with  words  such  as  “language,” 
“languages,”  and  “linguistic”).  By  contrast,  the  Aus¬ 
tralia  program  (high  entropy)  was  designed  to  support 
US-Australia  cooperative  research  across  many  fields, 
as  correctly  inferred  by  our  model. 


j  Low  Entropy  Program  Elements  j 

Program 

Primary  LDA  Topic 

1311  Linguistics 

language  languages  linguistic 

4091  Network  Infrastructure 

network  connection  internet 

j  High  Entropy  Program  Elements  | 

Program 

Primary  LDA  Topic 

5912  Australia 

(many  topics  &  disciplines) 

9130  Res.  Improvements  in  Minority  Instit. 

(many  topics  &  disciplines) 

Table  1 :  [Examples  of  High/Low  Entropy  Programs.] 


Which  research  areas  are  growing/emerging?  When 

defining  document  groups  as  calendar  years  (instead  of 
program  elements),  low  entropy  nodes  in  C  are  topics 
concentrated  around  certain  years.  Concentrations  in 

4Data  for  years  1989  and  2003  in  this  publicly  available 
corpus  were  partially  missing  and  omitted  in  some  analyses. 
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later  years  indicate  growth.  The  LDA-discovered  topic 
nanotechnology  is  among  the  lowest  entropy  topics 
( i.e .,  an  outlier  topic  with  respect  to  entropy).  As  shown 
in  Figure  2,  the  number  of  nanotechnology  grants  dras¬ 
tically  increased  in  proportion  through  2002.  This  re¬ 
sult  is  consistent  with  history,  as  the  National  Nan¬ 
otechnology  Initiative  was  proposed  in  the  late  1990s  to 
promote  nanotechnology  R&D.5  One  could  also  mea¬ 
sure  such  trends  using  budget  allocations  by  incorpo¬ 
rating  the  award  amounts  into  the  edge  weights  of  G. 
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Figure  2:  [Uptrend  in  Nanotechnology.]  Our  model  cor¬ 
rectly  identifies  the  surge  in  nanotechnology  R&D  beginning 
in  the  late  1990s. 


implemented  by  taking  the  Spearman’s  rank  correla¬ 
tion  coefficient  of  a  and  b  (denoted  as  Rank).  Figure 
3  shows  the  Mean  Average  Precision  (MAP)  for  each 
method  and  each  value  of  n.  With  the  exception  of 
the  difference  between  Cosine  and  Wtd.  Jaccard  for 
MAP@3,  all  other  performance  differentials  were  sta¬ 
tistically  significant,  based  on  a  one-way  ANOVA  and 
post-hoc  Tukey  HSD  at  a  5%  significance  level.  This, 
then,  provides  some  validation  for  our  choice. 


1245  Theoretical  Physics 

1182  Ecology 

1286  Elementary  Particle  Theory 

1128  Ecological  Studies 

1287  Mathematical  Physics 

1 196  Environmental  Biology 

1 284  Atomic  Theory 

1 195  Ecological  Research 

Table  2:  [Similarity  Queries.]  Three  most  similar  programs 
to  the  Theoretical  Physics  and  Ecology  programs. 


Given  an  NSF  program,  to  which  other  programs 
is  it  most  similar?  As  described  in  Section  3,  when 
each  node  in  P  represents  an  NSF  program,  our  model 
can  easily  identify  the  programs  most  similar  to  a 
given  program.  For  instance.  Table  2  shows  the  top 
three  most  similar  programs  to  both  the  Theoretical 
Physics  and  Ecology  programs.  Results  agree  with  in¬ 
tuition.  For  each  NSF  program,  we  identified  the  top 
n  most  similar  programs  ranked  by  our  sim( •,  •)  func¬ 
tion,  where  n  £  {3,6,9}.  These  programs  were  manu¬ 
ally  judged  for  relatedness,  and  the  Mean  Average  Pre¬ 
cision  (MAP),  a  standard  performance  metric  for  rank¬ 
ing  tasks  in  information  retrieval,  was  computed.  We 
were  unsuccessful  in  evaluating  alternative  weighted 
similarity  measures  mentioned  in  Section  3  due  to  their 
aforementioned  issues  with  scalability  and  the  size  of 
the  NSF  dataset.  (For  instance,  the  implementations  of 
ASCOS  (Antonellis  et  ah,  2008)  and  SimRank  (Jeh  and 
Widom,  2002)  that  we  considered  are  available  here.6) 
Recall  that  our  sim(-,  •)  function  is  based  on  measuring 
the  cosine  similarity  between  two  weight  vectors,  a  and 
b,  generated  from  our  bipartite  graph  model.  As  a  base¬ 
line  for  comparison,  we  evaluated  two  additional  simi¬ 
larity  implementations  using  these  weight  vectors.  The 
first  measures  the  similarity  between  weight  vectors  us¬ 
ing  weighted  Jaccard  similarity,  which  is  max^’^) 
(denoted  as  Wtd.  Jaccard).  The  second  measure  is 

5 http ://en. wikipedia .org/wiki/ 

Nat ional_Nanotechnology_Initiative 
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Figure  3:  [Mean  Average  Precision  (MAP).]  Cosine  simi¬ 
larity  outperforms  alternative  approaches. 

How  do  NSF  programs  join  together  to  form  larger 
program  categories?  As  mentioned,  by  using  the  sim¬ 
ilarity  graph  Gp  constructed  from  G,  clusters  of  re¬ 
lated  NSF  programs  can  be  discovered.  Figure  4,  for 
instance,  shows  a  discovered  cluster  of  NSF  programs 
all  related  to  the  field  of  neuroscience.  Each  NSF  pro¬ 
gram  (i.e.,  node)  is  composed  of  many  documents. 
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Figure  4:  [Neuroscience  Programs.]  A  discovered  cluster 
of  program  elements  all  related  to  neuroscience. 


Which  pairs  of  grants  are  the  most  similar  in  the 
research  they  describe?  Although  the  focus  of  this 
paper  is  on  drawing  comparisons  among  groups  of 
documents,  it  is  often  necessary  to  draw  comparisons 
among  individual  documents,  as  well.  For  instance, 
in  the  case  of  this  NSF  corpus,  one  may  wish  to  iden¬ 
tify  pairs  of  grants  from  different  programs  describing 
highly  similar  lines  of  research.  One  common  approach 
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to  this  is  to  exploit  the  low-dimensional  representa¬ 
tions  of  documents  returned  by  LDA  (Blei  et  al.,  2003). 
Any  given  document  di  £  D  (where  i  £  {1 ...  N}) 
can  be  represented  by  a  K-dimensional  probability  vec¬ 
tor  of  topic  proportions  given  by  the  ith  row  of 
the  document-topic  matrix  6.  The  similarity  between 
any  two  documents,  then,  can  be  measured  using  the 
distance  between  their  corresponding  probability  vec¬ 
tors  (i.e.,  probability  distributions).  We  quantify  the 
similarity  between  probability  vectors  using  the  com¬ 
plement  of  Hellinger  distance:  Hs(dx,dy)  =  1  — 

75\/E^i(vd-  where  x,y  £  {1...N}. 

Unfortunately,  identifying  the  set  of  most  similar  doc¬ 
ument  pairs  in  this  way  can  be  computationally  ex¬ 
pensive,  as  the  number  of  pairwise  comparisons  scales 
quadratically  with  the  size  of  the  corpus.  For  the 
moderately-sized  NSF  corpus,  this  amounts  to  well 
over  8  billion  comparisons.  To  address  this  issue,  our 
bipartite  graph  model  can  be  exploited  as  a  blocking 
heuristic  using  either  the  document  groups  or  the  com¬ 
parison  criteria.  In  the  latter  case,  one  can  limit  the 
pairwise  comparisons  to  only  those  documents  that  re¬ 
side  in  the  same  subset  of  Dc .  For  the  former  case, 
node  similarity  can  be  used.  Instead  of  comparing  each 
document  with  every  other  document,  we  can  limit  the 
comparisons  to  only  those  document  groups  of  interest 
that  are  deemed  similar  by  our  model.  As  an  illustrative 
example,  out  of  the  665  different  NSF  programs  cov¬ 
ering  these  132,  372  grant  abstracts,  the  program  1271 
Computational  Mathematics  and  the  program  2865  Nu¬ 
meric,  Symbolic,  and  Geometric  Computation  are  in¬ 
ferred  as  being  highly  similar  by  our  model.  Thus,  we 
can  limit  the  pairwise  comparisons  to  only  such  docu¬ 
ment  groups  that  are  similar  and  likely  to  contain  sim¬ 
ilar  documents.  In  the  case  of  these  two  programs,  the 
following  two  grants  are  easily  identified  as  being  the 
most  similar  with  a  Hellinger  similarity  (Hs)  score  of 
0.73  (only  text  snippets  are  shown  due  to  space  con¬ 
straints): 

Grant  #1 

Program :  1271  Computational  Mathematics 

Title:  Analyses  of  Structured  Computational 
Problems  and  Parallel  Iterative  Algorithms. 

Abstract:  The  main  objectives  of  the  re¬ 

search  planned  is  the  analysis  of  large  scale 
structured  computational  problems  and  of  the 


convergence  of  parallel  iterative  methods  for 
solving  linear  systems  and  applications  of  these 
techniques  to  the  solution  of  large  sparse  and 
dense  structured  systems  of  linear  equations 

Grant  #2 

Program:  2865  Numeric,  Symbolic,  and 

Geometric  Computation 

Title:  Sparse  Matrix  Algorithms  on  Dis¬ 
tributed  Memory  Multiprocessors. 

Abstract:  The  design,  analysis,  and  imple¬ 
mentation  of  algorithms  for  the  solution  of 
sparse  matrix  problems  on  distributed  memory 
multiprocessors  will  be  investigated.  The 
development  of  these  parallel  sparse  matrix 
algorithms  should  have  an  impact  of  challeng¬ 
ing  large-scale  computational  problems  in 
several  scientific,  econometric,  and  engineering 
disciplines. 

Some  key  terms  in  each  grant  are  manually  highlighted 
in  bold.  As  can  be  seen,  despite  some  differences  in 
terminology,  the  two  lines  of  research  are  related,  as 
matrices  (studied  in  Grant  #2)  are  used  to  compactly 
represent  and  work  with  systems  of  linear  equations 
(studied  in  Grant  #1).  That  is,  despite  such  differences 
in  terminology  (e.g.,  “matrix”  vs.  “linear  systems”, 
“parallel”  vs.  “distributed”),  document  similarity  can 
still  be  accurately  inferred  by  taking  the  Hellinger  sim¬ 
ilarity  of  the  LDA-derived  low-dimensional  represen¬ 
tations  for  the  two  documents.  In  this  way,  by  exploit¬ 
ing  the  group-level  similarities  inferred  by  our  model  in 
combination  with  such  document-level  similarities,  we 
can  more  effectively  “zero  in”  on  such  highly  related 
document  pairs. 

5  Conclusion 

We  have  presented  a  bipartite  graph  model  for  draw¬ 
ing  comparisons  among  large  groups  of  documents. 
We  showed  how  basic  algorithms  using  the  model  can 
identify  trends  and  anomalies  among  the  document 
groups.  As  an  example  analysis,  we  demonstrated  how 
our  model  can  be  used  to  better  characterize  and  eval¬ 
uate  NSF  research  programs.  For  future  work,  we  plan 
on  employing  alternative  comparison  criteria  in  our 
model  such  as  those  derived  from  named  entity  recog¬ 
nition  and  paraphrase  detection. 
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